Modelling uncertainty and correlation in multivariate linear regression using affine transformation from Gaussian parameter space to the linear domain.

This article explores the ability of Bayesian models to learn the generative distribution properties of linear data using multivariate Gaussian priors and hence gain more informative posteriors than we normally obtain from line fitting regression. Implemented in Python with PyMC3.

Overview

It seems an obvious statement not always appreciated that Bayesian linear regression is not fitting points to a line, it is fitting points to Gaussian distributions.

The line is an abstraction. The apparent scatter is because each point is lying on its own surface, based on a starting point and trend drawn from these Gaussians. We cannot know the true surface each point is on, unless we have labelled trajectories over time, but we can infer the probability distributions for these starting points and trends. (Here we refer to line and surface interchangeably as being hypersurfaces in one or two dimensions).

Ordinary Least Squares

The conventional picture of linear regression is of a line drawn through some scatter plot of points. Classic ordinary least square (OLS) assumes there is a single 'best' line and that the points do not lie exactly on the line because of errors in observation and other unobserved independent random effects. Confidence in the solution is represented in terms of the distance of the points from this 'best' line.

Bayesian Regression

The Bayesian view is that there is not a single 'best' line but a family of lines each of which carries a certain probability of being the 'true' line, based on the given data and priors. Points are not expected to lie on a single surface because there are errors in observations, which can be modelled, and because the slope and intercept of the 'true' line can only be known approximately, and hence these are modelled as distributions.

Bayesian Regression - a deterministic viewpoint

We want to explore in this article a more ambitious target, based on learning the true shape of the generative distributions. Setting aside observational errors for now, we suggest that the physical reality underlying the Bayesian interpretation is not that there is a single 'true' line which we are trying to find from inherently noisy data, but that there are many actual 'true' trajectories, each of which generates linear data points in a deterministic way. The scatter in the data is not inherent but is a consequence of our lack of knowledge of the start conditions for each trajectory; there are many trajectories in our observed system, arising from the complex underlying processes which generate these trajectories. The distribution shape of the trajectories' initial conditions tells us about these underlying processes.

The objective in Bayesian linear regression is not just to get the 'best' line - it is very possible to obtain an incorrect but well-fitting model - our objective is to model as accurately as possible the properties of the distributions that represent the physical processes from which our observed data is drawn. This includes separating noise terms from other uncertainties so that we can understand both the underlying deterministic and stochastic physical processes giving us not just a descriptive but an explanatory model which transcends the sample domain. The probability distributions we use in solving these problems are not artifacts or tricks, but should be interpreted as revealing something true about the physical priors.

We can interpret linear models as a special case of Gaussian models, with a constraint across datapoints in a particular dimension, in a similar way that Gaussian Processes (GP) are Gaussian models constrained by a smoothing kernel across adjacent datapoints. The line, or curve in GP, or surface in the multivariate case, is an abstraction influenced by our ability to visualize it. Then a linear mixed model is essentially a Gaussian mixture model with additional linear constraints.

Gaussian affine transformation

We use below the property that Gaussians under linear transformation of Gaussian X by a linear transformation A, with scalar vector b, Y = A.x + b, remain Gaussians. Then Y has mean mu = b + A.mu_x and covariance = A.sigma_x.A(T).

0.1 Regression surface

We assume that every observable $z$ at point $x$ is evidence of an underlying model in $intercept$ and slope $β$:

$z_{obs} = intercept + x\cdotβ$

We assume that there exist probability distributions for the intercept and slope of our system:

$ I \sim \mathcal{N}(\mu_{intercept},\,\Sigma_{intercept})\,$

$ B \sim \mathcal{N}(\mu_\beta,\,\Sigma_\beta)\,$

Then $intercept$ and $β$ are drawn from these distributions and we are looking for the probability of an observation $z_{obs}$ at a point $x$ given these assumptions.

0.2 Affine transformation

For a Gaussian, an affine transformation of $B$ by $X$ of the form $Y=i + XB$, where $i$ is a constant vector, gives a new Gaussian:

$ Y \sim \mathcal{N}(\mu_y,\,\Sigma_y)\,$ where $\mu_y=i + X\mu_\beta$ and $\Sigma_y=X\Sigma_\beta X^T$

This is close to what we want, but we must also consider that our intercept is not a constant vector but is also drawn from a distribution. This has a 'defocussing' effect on $y$ since for each draw of the slope $β$ at point $x$, there will an uncertainty in the value of $y$ corresponding to the uncertainty in the draw of the intercept. This amounts to a convolution of the $Y$ distribution with the distribution for $I$. Each possible distribution of $y$ is convoluted with a mean shifted by each possible draw of $intercept$, which has the effect of spreading the uncertainty in $y$:

0.3 Convolution of Intercept and Slope

$Z = Y \circledast I$

The useful property of Gaussians is that the variance of the convolution is the sum of the variances of the contributing distributions, hence we obtain the result we want for the distribution of observables of z at a point x:

$Z \sim \mathcal{N}(\mu_z,\,\Sigma_z)\,$ where $\mu_z=\mu_{intercept} + X\mu_\beta$ and $\Sigma_z=\Sigma_{intercept} + X\Sigma_\beta X^T$

In the case where $x$ is a constant vector with the same number of elements as $β$, which is our case for $z_{obs}$ we obtain

$Z \sim \mathcal{N}(\mu_{intercept} + x\cdot{\mu_β}, x\Sigma_\beta x^T)$

0.4 Distributions for Affine Linear Regression

In our bivariate case, the slope $β$ is composed of $β_x$ and $β_y$ and each sample point in $X$ is a coordinate $x$, $y$, hence:

$\mu_z = \mu_{intercept} + x\cdot\mu_β = \mu_{intercept} + x \mu_{\beta x} + y\mu_{\beta y}$

$\Sigma_z = \Sigma_{intercept} + [x, y]\cdot\begin{bmatrix} \Sigma_{\beta xx} & \Sigma_{\beta xy}\\ \Sigma_{\beta yx} & \Sigma_{\beta yy} \end{bmatrix}\cdot[x, y]^T$

Also note below that for the univariate distributions $I$ and $Z$, $\Sigma_{intercept} = \sigma_{intercept} ^ 2$ and $\Sigma_z = \sigma_z ^ 2$

Implementation

Below we are creating a simple exploration of these ideas solving a multivariate linear regression model using PyMC3. Article roadmap is as follows:

  1. Verify numerically that the calculated affine distribution above correctly mimics test values of $z$ calculated in the $xy$ domain from samples drawn from the test intercept and slope distributions
  2. Show that taking the samples of slope as observables, the PyMC3 model of the Gaussian distribution of $B$ yields good values for the mean and covariance, $\mu_\beta,\,\Sigma_\beta$ of this distribution
  3. Setup the PyMC3 using the affine distribution $Z$ and show that taking the samples of $z$ as observables we can obtain good values for the parameters of distributions $I$ and $B$, being $\mu_{intercept},\,\sigma_{intercept}, \mu_\beta,\,\Sigma_\beta$
  4. Compare our affine approach against the conventional linear regression approach, modelling unconnected slope and intercept priors along with a noise prior
  5. Modify the conventional approach by adding covariance and mean to slope and compare against our affine regression approach
  6. Conclusion

1. Generated sample data

Often practical discussions of linear regression start by simulating test data, adding random noise to a sample linear dataset. While this achieves the objective of verifying the robustness of linear model to the random error term, it does nothing to test the ability of the model to reflect the uncertainty in the intercept and gradient.

To properly test the model performance we must simulate data by drawing intercept and gradient from simulated distributions, generating linear points across this line and then adding noise. This must be done for samples across these distributions and the modelled posterior distributions must be checked to see that they correctly reproduce the parameters of the sample distributions. Hence we move away from the notion that there is a single 'true' line which we are trying to model, and instead give full status to the parameters of the generating distributions as saying something about the underlying causal behaviour.

It can clearly be seen below that, as is expected, the sampled surfaces drawn from our intercept and slope distributions diverge from the mean plan with increasing values of x and y. Data observed in this domain might appear to exhibit errors that are non homoscedastic, but that would be an incorrect interpretation of the data. The diverging data distribution in the xy plane arises from the spatial transformation of a normal distribution to the linear domain.

For the purposes of the ongoing tests we are deliberately not adding noise to our sampled data so that we can test the hypothesis that we can uncover the true parameter distributions from the apparently noisy data. These parameters are effectively latent variables. We cannot observe them directly, but only their effect in a transformed space.

Here we calculate the observable value $z$ at each point $X$ in the plane for each draw sample of $intercept$ and slope $β$

$z_{sample} = intercept_{sample} + X\cdotβ_{sample}$

1.1 Test data points

In order to visualize the behaviour of the model, we devise a set of 9 data points in the xy plane at which we will evaluate the projected line paths and predicted observations.

1.2 Constructing a Gaussian linear model using Affine transformation and Convolution

In order to use the Bayesian approach we need to form a likelihood function which will map our prior distributions onto points in the linear space of observations.

Here we use the affine transformation in 2 dimensions, which has the effect of projecting our multivariate slope and intercept distributions onto a univariate distribution which gives the probability of an observation having a certain value at each point x, y in the linear domain.

At each point x,y, the observational distribution has the following form:

$\mu_z = \mu_{intercept} + x\cdot\mu_β$

$\sigma_z = \sqrt{\sigma_{intercept} ^2 + [x, y]\cdot\begin{bmatrix} \Sigma_{\beta xx} & \Sigma_{\beta xy}\\ \Sigma_{\beta yx} & \Sigma_{\beta yy} \end{bmatrix}\cdot[x, y]^T}$

1.3 Testing Affine Gaussian distribution against simulated data

It is valuable to compare the analytic affine Gaussian distributions above against the result of simulating the distributions stochastically by drawing samples of the projected plane surfaces.

At each test xy point, each point will have a set of observation values corresponding to the set of drawn intercept and slope samples. These probability densities of the values at these points should correspond to the analytic linear affine transformed distribution.

To test this we:

  1. Estimate the sampled population intercept, standard deviation and slope mean and covariance of the draws from the intercept and slope distributions above
  2. From these parameters use our affine transformation to obtain transformed mean and covariances at specific test xy points
  3. Using the transformed mean and covariance, plot the Gaussian distribution for each of the test points
  4. Also plot a kernel density estimate (KDE) of the stochastic observables at these test points

1.4 Comparison of 'True' and sampled distributions

For the purposes of the tests below, we have drawn a sample set of intercepts and slopes as described above and comparisons of model performance are made against the sampled data, as though it were truth, since without informative priors on the 'truth', the inference cannot do better that reproduce the sampled distributions.

Here we compare the affine transformation of our sample data distributions at each of the test points and compare with the distribution of observations based on those sample distributions. It can be seen that the affine transformations which are generating the Sample PDFs match very closely that KDE estimates of the actual sampled observations. In particular it can be seen that the origin point (0,0) is represented only by the intercept distribution since the slopes at that point have zero contribution.

Note In all these charts the solid lines represent PDFs generated as normal distributions with mean and sigma calculated using the affine calculations based on $μ_i, σ_i, μ_β, Σ_β$ at the relevant point xy.

2. Modelling Gaussian distributions

If we were doing a conventional Gaussian multivariate model we would now fit these bivariate slope samples to a 2-dimensional covariate matrix (Ref https://docs.pymc.io/notebooks/LKJ.html).

First it is useful to visualize the slope distribution projected using the affine transformation to two of the sample test points in xy:

2.1 Discussion on the uncertainties in the Mean

The most significant change to the documented example is that we use a uniform prior on the slope mean. If a distribution prior is used, such as a Normal, with a meta-prior set on the standard deviation, the result is poorly behaved because the sample mean is a scalar not a distribution. On the assumption of data homogeneity, a sample of means from a population approaches the population mean, hence if we try to fit a distribution we should expect that it will fail. Our best approach is to set a uniform prior across the range of possible values to allow a grid search for the value that best fits our overall model. We can seed this search with the computed mean of the sample data, but this does not necessarily imply that our inferred posterior will have this value.

2.2 Results

Running the model with the uniform prior over 1,800 data points with 5,000 draws over 4 chains is fast, and the result can be seen to be an excellent fit to the ground truth data.

We can also see that the r_hat values are all very close to 1.00 with no divergencies, indicating that the four chains converged smoothly, and posterior traces are very well behaved across all the parameters.

3. Setting up affine linear regression scenario

In our scenario of linear regression, we cannot directly observe the draws for intercept and slope, but we can observe the evolution of a system based on these parameters. Each intercept and slope represents the starting conditions for a surface. Our system will evolve over these surfaces such that every observable data point will lie on one of these surfaces and will have a probability density of values related to its xy coordinate and the distributions of intercept and slope. The shape of the distribution at each point will differ, which would seem to make inference impossible, but the distributions all share the common hidden parameters of the generative distributions. This gives us hope that we will be able to obtain a convergence in the MCMC space.

Note that we intentionally are not attempting to model random noise in the observational data. Our sample does not contain noise and our model does not have a prior to represent it.

3.1 Results of affine model run

We can see above that the model has converged very cleanly with no divergences and r_hat close to 1.00. Also we see that the slope and intercept means are close to the expected sample values.

Of the greatest significance, and the distinct feature of the success of our approach, we can see that we have obtained good estimates for both the diagonal and correlation components of the covariance matrix. When we consider that these components are only indirectly related to the observed data and although they influence the slopes in x and y they cannot be directly inferred from those slopes, it is truly an aggregate behaviour of the model over many point distributions that yields these latent values.

The robustness of the convergence, given a relatively small amount of data, suggests that there is great scope for extending this model to cover mixed models with more granular structure to the intercept and slope distributions.

As we would expect the mean posterior samples lie on the expected mean plane of our distributed data:

3.2 Visualization of posterior sample points and distributions

It can be seen below that the generated posterior sample points yield distributions that are very close the anticipated sample data distributions. In particular the intercept is well modelled as is the profile of variances at points distant from the origin

4. Comparison and contrast with conventional linear regression approach

The conventional approach generally taken to linear regression over sampled n-dimensional observations X is as follows:

  1. Create HalfCauchy standard deviation priors $σ_i$, $σ_β$ for intercept and slope
  2. Create univariate normal prior for intercept $μ_i$ with mean 0 and standard deviation $σ_i$
  3. Create n-dimensional univariate normal priors on slope β with mean 0 and standard deviation $σ_β$
  4. Create HalfCauchy standard deviation prior for observations $σ_{noise}$
  5. Create univariate normal likelihood with mean $μ_o = μ_i + X\cdotβ$ and standard deviation $σ_{noise}$
  6. Condition against observables $y_{obs}$ at points $X$

The $σ_{noise}$ term is essentially an artifact to catch errors which arise because we have properly modelled the prior distributions. This can be seen when we run this model on our sample data:

4.1 Results of conventional linear regression model

Clearly the means compare well with target, which is to be expected as these are conditioned directly by the observables, but we have no correlation coefficients and the standard deviations $σ_i$ and $σ_{noise}$ are very large, as they must absorb all the errors which are artifacts due to the weak model.

5. Modification of conventional multivariate linear regression approach

We can try to improve the results of our conventional approach by making the following changes for the slope prior β:

  1. Include a uniform prior $μ_β$ as earlier to obtain the mean slope
  2. Instead of an n-dimensional univariate normal priors on $β$, use a multivariate normal, with covariance $Σ_β$, conditioned on the Cholesky prior as earlier.

5.1 Results of multivariate linear regression model based on non-affine approach

It is clear that this revised multivariate model has failed to converge correctly. It has not been able to handle the uncertainties in the Cholesky prior and this has resulted in many divergences. Also, the results for the Covariance posterior are wrong, with close to zero correlation and incorrect variances.

This highlights the problem with the conventional approach. Instead of sampling from a single robust likelihood which correctly models the interaction between the coordinates, intercept and slope priors, where each point in the linear space is fully represented by its own Gaussian distribution, combining all the known information about the expected value at that point, the simple conventional model disconnects the slope prior from the linear space and hence is unable to extract the information in the individual data points.

The posterior means being correct, a naive interpretation of this model would be that we have successfully matched the mean regression surface, and we would imagine we have solved the problem. But if we compare the posterior sample observations and distributions it can be seen that the observations are scattered with a common variance across the linear space and we have lost all of the resolution of the shape of the slope covariant distribution which contributed to the profile of the true sample data.

Hence our predictions of confidence intervals or credibility intervals, on which we might test a hypothesis or use as the basis of downstream analysis, are false to a significant degree.

Intercept Mean:

Intercept Standard Deviations:

Slope Means:

Slope Covariance:

Noise terms - non-affine regression artifacts

5.2 Visualizations of deficiencies in simple multivariate regression

We can see below that the posterior samples from this conventional approach not follow the expected behaviour of our sample data, and that drilling into the detailed distributions at each of the test points it can be seen that the posterior distributions do not match the expected variances.

6. Conclusion

Machine learning is the extraction of signal from noise, not just line fitting, and our claim as Bayesians is that there is physical significance to the probability distributions of priors and posteriors.

The great power of Bayesian inference is its ability to learn the complex properties of apparently noisy data and model not just the mean but also the probability densities of the observed data. This allows for further interpretation, risk and cost analysis, decision making and planning. In many industries, such as medical research and oil exploration, it is vital to work with uncertainties in action planning based on uncertain data.

Here we have interpreted the data, and corresponding model, as a collection of true trajectories, each with fixed intercept and slope and associated multivariate probability distributions.

  1. We have shown that variances in the uncertainties in our simulated data are correctly modelled by our prior distributions and that we can reliably reproduce the values of our simulated intercept and slope mean and covariance.

  2. Going forward we can investigate patterns of non-homogeneous linear data points that can be grouped, where points in each group are derived from the same linear trajectory, hence breaking our modelled posteriors into multimodal distributions, which is the basis of linear mixed models.

  3. We can seek to learn more granular complex models of the distributions for intercept and slope and seek to resolve these uncertainties into more fundamental uncertainties, possibly using latent variables, giving greater understanding of the underlying deterministic processes.

We do not suggest that there is a single interpretation of linear regression which is right or wrong but that we should look for an interpretation which allows the greatest richness in uncovering the properties of observable systems. By strictly implementing dynamic, deterministic, causal models, Bayesian methods give us a rich toolset for discovering the building blocks of a complex world.

"a thing cannot occur without a cause... The curve described by a simple molecule is regulated in a manner just as certain as the planetary orbits; the only difference between them is that which comes from our ignorance. ... The theory of chance consists in reducing all the events of the same kind to such as we may be equally undecided about in regard to their existence." Pierre-Simon Laplace, Essay on Probabilities, 1795.

6.1 Work in Progress

  1. Our method lends itself to multivariate Gaussian mixture models and hence we intend to continue to explore its power in this domain.
  2. We also intend to generate experimental results varying sample sizes and the effect of changing the number of observations per sample trajectory.
  3. We have modelled correlations in slope but this does not tell us anything about the shape of the sample data points in the xy domain. We will add and investigate a correlation model which allows for distributions in the x, y plane of the sample data.
  4. We intend to investigate the effect of conventional random noise added to observations and we believe we will be able to model this noise much more accurately having separated out the full uncertainty components of the intercept and slope distributions.
  5. We also intend to prove this model on existing datasets.

References

Vassend, Olav B. (2017) Nonstandard Bayesianism: How Verisimilitude and Counterfactual Degrees of Belief Solve the Interpretive Problem in Bayesian Inference. http://philsci-archive.pitt.edu/13301/1/philsciarchive2016.pdf